{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Mini-batching"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In its purest form, online machine learning encompasses models which learn with one sample at a time. This is the design which is used in River.\n",
    "\n",
    "The main downside of single-instance processing is that it doesn't scale to big data, at least not in the sense of traditional batch learning. Indeed, processing one sample at a time means that we are unable to fully take advantage of [vectorisation](https://www.wikiwand.com/en/Vectorization) and other computational tools that are taken for granted in batch learning. On top of this, processing a large dataset in River essentially involves a Python `for` loop, which might be too slow for some usecases. However, this doesn't mean that River is slow. In fact, for processing a single instance, River is actually a couple of orders of magnitude faster than libraries such as scikit-learn, PyTorch, and Tensorflow. The reason why is because River is designed from the ground up to process a single instance, whereas the majority of other libraries choose to care about batches of data. Both approaches offer different compromises, and the best choice depends on your usecase.\n",
    "\n",
    "In order to propose the best of both worlds, River offers some limited support for mini-batch learning. Some of River's estimators implement `*_many` methods on top of their `*_one` counterparts. For instance, `preprocessing.StandardScaler` has a `learn_many` method as well as a `transform_many` method, in addition to `learn_one` and `transform_one`. Each mini-batch method takes as input a `pandas.DataFrame`. Supervised estimators also take as input a `pandas.Series` of target values. We choose to use `pandas.DataFrames` over `numpy.ndarrays` because of the simple fact that the former allows us to name each feature. This in turn allows us to offer a uniform interface for both single instance and mini-batch learning.\n",
    "\n",
    "As an example, we will build a simple pipeline that scales the data and trains a logistic regression. Indeed, the `compose.Pipeline` class can be applied to mini-batches, as long as each step is able to do so."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-12-04T17:48:50.784905Z",
     "iopub.status.busy": "2023-12-04T17:48:50.784206Z",
     "iopub.status.idle": "2023-12-04T17:48:51.085175Z",
     "shell.execute_reply": "2023-12-04T17:48:51.084724Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from river import compose\n",
    "from river import linear_model\n",
    "from river import preprocessing\n",
    "\n",
    "model = compose.Pipeline(\n",
    "    preprocessing.StandardScaler(),\n",
    "    linear_model.LogisticRegression()\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this example, we will use `datasets.Higgs`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-12-04T17:48:51.087193Z",
     "iopub.status.busy": "2023-12-04T17:48:51.087065Z",
     "iopub.status.idle": "2023-12-04T17:48:51.238054Z",
     "shell.execute_reply": "2023-12-04T17:48:51.237686Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Higgs dataset.\n",
       "\n",
       "The data has been produced using Monte Carlo simulations. The first 21 features (columns 2-22)\n",
       "are kinematic properties measured by the particle detectors in the accelerator. The last seven\n",
       "features are functions of the first 21 features; these are high-level features derived by\n",
       "physicists to help discriminate between the two classes.\n",
       "\n",
       "      Name  Higgs                                                                       \n",
       "      Task  Binary classification                                                       \n",
       "   Samples  11,000,000                                                                  \n",
       "  Features  28                                                                          \n",
       "    Sparse  False                                                                       \n",
       "      Path  /Users/max/river_data/Higgs/HIGGS.csv.gz                                    \n",
       "       URL  https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz\n",
       "      Size  2.62 GB                                                                     \n",
       "Downloaded  True                                                                        "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from river import datasets\n",
    "\n",
    "dataset = datasets.Higgs()\n",
    "if not dataset.is_downloaded:\n",
    "    dataset.download()\n",
    "dataset"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The easiest way to read the data in a mini-batch fashion is to use the `read_csv` from `pandas`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-12-04T17:48:51.239810Z",
     "iopub.status.busy": "2023-12-04T17:48:51.239681Z",
     "iopub.status.idle": "2023-12-04T17:48:53.685265Z",
     "shell.execute_reply": "2023-12-04T17:48:53.683815Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "names = [\n",
    "    'target', 'lepton pT', 'lepton eta', 'lepton phi',\n",
    "    'missing energy magnitude', 'missing energy phi',\n",
    "    'jet 1 pt', 'jet 1 eta', 'jet 1 phi', 'jet 1 b-tag',\n",
    "    'jet 2 pt', 'jet 2 eta', 'jet 2 phi', 'jet 2 b-tag',\n",
    "    'jet 3 pt', 'jet 3 eta', 'jet 3 phi', 'jet 3 b-tag',\n",
    "    'jet 4 pt', 'jet 4 eta', 'jet 4 phi', 'jet 4 b-tag',\n",
    "    'm_jj', 'm_jjj', 'm_lv', 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb'\n",
    "]\n",
    "\n",
    "for x in pd.read_csv(dataset.path, names=names, chunksize=8096, nrows=3e5):\n",
    "    y = x.pop('target')\n",
    "    y_pred = model.predict_proba_many(x)\n",
    "    model.learn_many(x, y)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are familiar with scikit-learn, you might be aware that [some](https://scikit-learn.org/dev/computing/scaling_strategies.html#incremental-learning) of their estimators have a `partial_fit` method, which is similar to river's `learn_many` method. Here are some advantages that river has over scikit-learn:\n",
    "\n",
    "- We guarantee that river's is just as fast, if not faster than scikit-learn. The differences are negligeable, but are slightly in favor of river.\n",
    "- We take as input dataframes, which allows us to name each feature. The benefit is that you can add/remove/permute features between batches and everything will keep working.\n",
    "- Estimators that support mini-batches also support single instance learning. This means that you can enjoy the best of both worlds. For instance, you can train with mini-batches and use `predict_one` to make predictions."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that you can check which estimators can process mini-batches programmatically:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-12-04T17:48:53.690630Z",
     "iopub.status.busy": "2023-12-04T17:48:53.690343Z",
     "iopub.status.idle": "2023-12-04T17:48:53.852127Z",
     "shell.execute_reply": "2023-12-04T17:48:53.850613Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "LocalOutlierFactor\n",
      "OneClassSVM\n",
      "MiniBatchClassifier\n",
      "MiniBatchRegressor\n",
      "MiniBatchSupervisedTransformer\n",
      "MiniBatchTransformer\n",
      "SKL2RiverClassifier\n",
      "SKL2RiverRegressor\n",
      "FuncTransformer\n",
      "Pipeline\n",
      "Select\n",
      "TransformerProduct\n",
      "TransformerUnion\n",
      "BagOfWords\n",
      "TFIDF\n",
      "LinearRegression\n",
      "LogisticRegression\n",
      "Perceptron\n",
      "OneVsRestClassifier\n",
      "BernoulliNB\n",
      "ComplementNB\n",
      "MultinomialNB\n",
      "MLPRegressor\n",
      "OneHotEncoder\n",
      "OrdinalEncoder\n",
      "StandardScaler\n"
     ]
    }
   ],
   "source": [
    "import importlib\n",
    "import inspect\n",
    "\n",
    "def can_mini_batch(obj):\n",
    "    return hasattr(obj, 'learn_many')\n",
    "\n",
    "for module in importlib.import_module('river.api').__all__:\n",
    "    if module in ['datasets', 'synth']:\n",
    "        continue\n",
    "    for name, obj in inspect.getmembers(importlib.import_module(f'river.{module}'), can_mini_batch):\n",
    "        print(name)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because mini-batch learning isn't treated as a first-class citizen, some of the river's functionalities require some work in order to play nicely with mini-batches. For instance, the objects from the `metrics` module have an `update` method that take as input a single pair `(y_true, y_pred)`. This might change in the future, depending on the demand.\n",
    "\n",
    "We plan to promote more models to the mini-batch regime. However, we will only be doing so for the methods that benefit the most from it, as well as those that are most popular. Indeed, River's core philosophy will remain to cater to single instance learning."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.12 ('river')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  },
  "vscode": {
   "interpreter": {
    "hash": "e6e87bad9c8c768904c061eafcb4f6739260ff8bb57f302c215ab258ded773dc"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
